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PATENT 
11670/999 



METHODS FOR DETERMINING THE BIOCHEMICAL AND BIOPHYSICAL 

PROPERTIES OF PROTEINS 



FIELD OF THE INVENTION 



m 



The present invention relates to a method of determining the biochemical or biophysical 
properties of a protein from its amino acid sequence. The invention further relates to a method 
for optimizing high-throughput protein expression and protein structure determination. 
Additionally, the present invention relates to a method for optimizing the screening of proteins as 
potential drug-targets and optimizing drug discovery techniques. 

BACKGROUND OF THE INVENTION 



« Genome sequencing projects are providing vast amounts of information. With the whole 

j| genome of many organisms, including humans, complete or nearing completion, the next 

:/f challenge involves the characterization of these gene products, proteins. The characterization of 

O proteins has focused on both the analysis of the 3 -dimensional structure and of the corresponding 

15 function. 



The completion and near completion of the sequencing phase of genome projects has 
ushered in the age of proteomics, the study of all gene products in an organism. This flood of 
sequence information coupled with recent advances in molecular and structural biology have also 
20 lead to the concept of "structural proteomics" or "structural genomics", the determination of 3- 
dimensional protein structures on a genome- wide scale. An important use of 3D-structural 
information of proteins is to uncover clues to protein function that are not detectable from 
sequence analysis. This application of structural proteomics is driven by the realization that 
fewer than 30% of all predicted eukaryotic proteins have a known function. 

25 
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While useful, analysis of the DNA sequence alone, generally does not allow one to infer 
the structure or function of gene products unless the sequence has high homology to another gene 
of known function. Gene sequence information does not provide a complete and accurate profile 
of protein function or structure. After transcription from DNA to RNA, the mRNA transcript 
5 may be spliced in different ways prior to translation into the protein. Following translation, many 
proteins are modified, for example by the addition of one or more carbohydrate or phosphate 
groups. These modifications are important to the structure and function of the protein but are not 
directly coded by that protein's gene. Thus, a single gene can code for many protein products. 
As a consequence, the proteome is far more complex than the genome. 

10 

The function of a protein derives from its 3 -dimensional structure. The sequence 
_ : information alone is generally insufficient to provide a detailed picture of a protein's structure 
*0 and function. The 3-dimensional structure of a protein generally provides more information 
y about function than does the sequence of the protein. Proteins with little sequence homology but 
f| high structural homology have often been found to have similar biochemical functions. The 
function of a protein often involves interaction with a small molecule, another protein or other 
biomolecule, such as a lipid, sugar, or nucleic acid. The interaction of the protein with its target 
^ molecule is determined by amino acid residues which are close in space due to the protein's 3- 
f ^ dimensional structure, allowing those residues to simultaneously interact with the target 
if) molecule. However, these amino acids may be distant according to the linear amino acid 
sequence. 

To predict a function for a new protein, the amino acid sequence of the predicted protein 
coding region, or open reading frame (ORF), is compared against all functionally assigned 

25 sequences in protein sequence databases. If significant sequence or motif homology is found 
between the ORF and a sequence of known function from the protein sequence database, it is 
assumed that the two sequences share the same, or similar, function. Unfortunately, most ORFs 
share little or no or only partial homology with a functionally assigned sequence. Thus, a large 
proportion of new ORFs are found to encode proteins of unknown function. In addition, for 

30 those ORFs that harbor some homology to another sequence, often the region of homology 
comprises only a small fraction of the total sequence, leaving the rest unknown. 
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The function of a new protein can often be predicted by determining its 3 -dimensional 
structure using nuclear magnetic resonance (NMR) or X-ray crystallography. The structure, 
rather than the amino acid sequence, is then compared to known protein structures of assigned 
function. This information is collected in the Protein Data Bank (PDB), which can be searched 
5 to find homologous structural features of known proteins. If structural homologs are found, the 
new protein may be predicted to have a function similar to that of the homolog. In many cases 
confirmation of the predicted function can be readily determined experimentally. This method 
has the potential to be far more reliable than primary sequence comparisons, as proteins with 
little sequence homology may adopt similar 3 -dimensional conformations that impart similar 

10 function. However, to date the PDB database contains relatively few unique protein structures 
(<2000) giving the database limited predictive powers. 

*j A related use of structural proteomics information is to determine a sufficient number of 

\! three-dimensional structures necessary to define a "basic parts list" of protein folds. Most other 

11 structures could then be modeled from this basis-set using computational techniques. This 
analysis becomes feasible when a sufficient number of high-resolution, three-dimensional protein 

si structures have been determined to establish rules of how all proteins fold into functional 
. J biological macromolecules. The long term goal is to determine experimental structures for all 
I ^ proteins because it is the subtle differences in protein structure that contribute to the diversity and 
W complexity of life, and current modeling techniques are not yet accurate enough to reveal these 
subtleties. 

As protein structure is fundamental to molecular biology and disease, structural 
proteomics will have an impact impacts on many areas of biology including drug development. 
25 Application of structural proteomics to the pharmaceutical industry includes providing protein 
structural information for drug development, including identification and/or validation of new 
drug targets. 

Historically, the explosion in gene sequence information has far outpaced the 
30 characterization of gene products. The processes of expressing and purifying proteins have 
represented a bottleneck in the efforts to obtain protein samples for 3 -dimensional structure 
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determination by NMR and X-ray crystallography. Current methods to express and purify 
proteins for structural determination are performed on a protein-by-protein basis with relatively 
low throughput. Generating high quality samples for structure determination by NMR or by X- 
ray crystallography (a well-behaving NMR sample or a well-diffracting crystal, respectively), is 
also a bottleneck in these efforts. An essential element for the success of structural proteomics is 
the development of high-throughput methods for protein expression, purification, sample 
preparation and structural determination. 

The invention provides for a database of protein sequence information and experimentally 
determined biochemical and biophysical properties of proteins. The database is analyzed using 
data-mining techniques to find correlations among protein sequence information, biochemical 
properties and biophysical properties. The correlations provide predictive rules relating a 
protein's sequence to its biochemical and biophysical behavior. Using the correlations obtained 
from the data-mining techniques, the properties of new proteins are determined given their amino 
acid sequence information. This allows the optimization of the conditions necessary for high- 
throughput techniques, such as expression, purification, crystallization, NMR-sample 
preparation, structure determination and screening for binding to other molecules. The high- 
throughput analysis of protein structure is intrinsic to the success of structural proteomics. Such 
high-throughput analysis is also applied to techniques for screening proteins as potential drug- 
targets. The predictive rules are useful for the optimization of conditions for drug discovery, 
greatly accelerating rational drug development. 

SUMMARY OF THE INVENTION 

It is an object of this invention to provide a general strategy for determining relationships 
among a protein's biochemical properties, biophysical properties and sequence. With the 
extraordinary volume of gene sequence information identified by the genome projects, the 
volume of protein sequence information will increase rapidly. 

It is an object of this invention to provide a database of protein sequence information and 
experimentally determined protein properties. The database can be analyzed using data-mining 



techniques to find correlations among protein sequence information, biochemical properties and 
biophysical properties. Using the empirical correlations obtained from the data-mining 
techniques, the properties of new proteins are determined given their amino acid sequence 
information alone or using a combination of the sequence information and one or more 
5 properties. 

Another object of the invention is to provide a method of determining the properties of a 
protein using the sequence information alone or in combination with one or more properties by 
applying the correlations obtained by the analysis of this database. 

10 

A further object of the invention is to provide a method for optimizing high-throughput 
protein structure determination. Using the predictive power of the empirical database in 
« conjunction with data-mining tools, and the correlations obtained therefrom, the biochemical and 
SI biophysical properties of new proteins can be predicted. Based upon these predictions, 
jj experimental conditions for the analysis of a protein, or class of proteins, are modified so that the 

analysis can be better performed. Conversely, the invention provides a screening method to 
J" identify proteins that exhibit the desired properties for structural analysis by NMR or X-ray 
^ crystallography. 

W A further object of the invention is to provide a method for optimizing high-throughput 

yi methods for drug-target discovery. The invention provides a method of predicting which 
proteins are amenable to investigation as drug targets, thus speeding up the drug discovery 
process. The invention provides a screening method to identify proteins that exhibit the desired 
properties for study as potential drug targets. 

25 

BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1. is a decision tree for discriminating between soluble and insoluble proteins. The nodes 
30 of the tree are represented by ellipses (intermediate nodes) and rectangles (final nodes or leaves). 
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DETAILED DESCRIPTION 

We have found that given a database of protein sequence and experimentally determined 
protein properties, we are able to derive a set of rules from protein sequence information that are 
5 predictive of a given protein's biophysical and biochemical properties. The proteins may include 
naturally occurring proteins, modified proteins, synthetic proteins and subdomains of proteins. It 
is often easier to work with the smaller subdomain of a given protein, for example in drug 
screening or drug design, and also structure determination. 

1 0 The database is constructed from protein sequence information and experimental data on 

protein biophysical and biochemical properties. The protein sequence information includes the 
primary amino acid sequence and characteristics which are directly derived from the sequence, 
5 including amino acid composition, the character of a region of the sequence, hydrophobicity, 
Ji charge, molecular weight, the presence and length of low complexity regions and the presence of 
fj sequence motifs found in other proteins. The amino acid composition includes such information 
H as the percent of a specific amino acid present in the sequence, the percent of a combination of 

two or more amino acids, and the percent of amino acids of a general class (such as, but not 
S limited to, hydrophobic, hydrophillic, aromatic, aliphatic, acidic, basic, charged, and the like). 
W Regions having a particular character may be, for example, regions of low sequence complexity, 
|f regions that are hydrophobic/hydrophilic, or charged regions (positive or negative). The source 
y or the sequence information is derived from the genomic DNA sequence, cDNA sequence, or 
synthetic DNA. The primary sequence information may come from any source, including 
human, animals, plants, yeast, bacteria, virus or engineered proteins. 

25 The biophysical properties which populate the database include, for example, thermal 

stability, solubility, isoelectric point, pH stability, crystalizability, conditions of crystallization, 
aggregation state, heat capacity (AC p ), resistance to chemical denaturation, resistance to 
proteolytic degradation, amide hydrogen exchange data, behavior on chromatographic matrices, 
electrophoretic mobility and resistance to degradation during mass spectrometry. Biophysical 

30 properties may also include amenability (suitability) for study by various investigative 
techniques, including nuclear magnetic resonance (NMR), X-ray crystallography, circular 
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dichroism (CD), light scattering, atomic adsorption, fluorescence, fluorescence quenching, mass 
spectroscopy, infrared spectroscopy (IR), electron microscopy, atomic force microscopy and any 
results obtained from these techniques. For each property, the conditions under which the 
property was determined is incorporated into the database. These conditions may include solvent 

5 choice, protein concentration, buffer components and concentration, pH, temperature and salt 
concentration. It is advantageous to record a protein's properties determined under a variety of 
experimental conditions. Additional proteins are studied using the same set of conditions. In all 
cases where it is applicable, negative information is recorded in the database (for example, 
insolubility, unsuitability for study by NMR, etc.) To insure uniformity of the data collected, it is 

1 0 preferred to perform the biophysical measurements on proteins that have been purified. It is 
especially preferable that the proteins are at least about 95% pure. 

# Among the biophysical properties which may be included in the database are those that 

SI relate to X-ray crystallographic techniques. These properties include conditions under which a 

13 protein does or does not crystallize, including solvents, precipitants, buffer components and 

H concentration, pH, temperature, and salt concentration. The properties also include any results 

s ' obtained from the X-ray crystallography studies, including three dimensional structure, 

; S characteristics of the crystal, including space group, solvent content, unit cell parameters, crystal 

W contacts, solution conditions and bound water, and substrate binding. Additionally, the database 

W may include how the various conditions employed effect result that are obtained. 

The biochemical properties which compose the database include expressability, or level 
of expression, in various vectors and hosts with various fusion tags and under various conditions, 
such as temperature and medium composition, the protein yield obtained from various vectors 
25 and hosts under various conditions, results of small molecule binding screens, subcellular 

localization, demonstrated utility as a drug target, and knowledge of protein-protein or protein- 
ligand interactions. A biochemical property of particular interest is the protein's potential as a 
drug target. 

30 An important aspect of the method of the invention is to have large numbers of proteins 

examined and compared under uniform conditions. The advent of high-throughput cloning and 
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expression techniques and of high-throughput protein purification techniques has contributed to 
the feasibility of collecting this large volume of information. In theory, one might be able to 
compile the type of data listed above on a larger number of proteins from published accounts in 
the literature. Data from literature sources is not acquired under "standard" or uniform 
conditions. Furthermore, it is hard to assess the quality control or to fully ascertain the 
experimental conditions in many literature papers. Therefore, such a literature database would 
inherently yield less reliable predictions. For example, one can find data on protein yields from 
E.coli expression for many proteins. However, the conditions of growth (length of incubation 
time, temperature, induction conditions. . . etc) are variable and can have effects on the 
experimental result. Thus, correlations between protein characteristics and expressability based 
on such data would be unreliable. Additionally, the intrinsic noise or scatter in the data will 
mask more subtle correlations. Furthermore, negative results are not reported in the literature, 
and these are just as important to record in the database and use in the data mining (for example: 
under what conditions is a given protein not soluble, or does a given protein not crystallize). 

According to the invention, to insure uniformity of the data, the biophysical and 
biochemical data are collected using a uniform set of conditions or experimental procedures. The 
conditions under which the empirical data are collected are important and are recorded in the 
database. Ideally, multiple conditions are recorded for each type of measurement. The conditions 
of the data collection (temperature, solution components, salt concentration, buffer, pH) can 
drastically effect the behavior of a given protein. Therefore, it is desirable to compare many 
proteins under the same set of conditions, so that the only variable is the protein sequence. 
Alternatively, one can compare a variety of conditions for a give protein (or set of proteins) and 
relate that to sequence features. 

In order to mine this data, it is annotated in the database using a "controlled vocabulary". 
For example, data entry for solubility could be either a number, such as a quantitative 
measurement (for example, solubility in mg/ml), or a qualitative numerical scale (for example, a 
scale of 0-5, with 0 being completely insoluble, and 5 being very soluble). Direct instrumental 
measurements can also be used if internal calibration standards are used, so that the values can be 
related to some standard. 
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As a sufficient quantity of data is compiled in the database, the data can be analyzed 
using data-mining techniques, or knowledge discovery tools, to find correlations among protein 
sequence information and biochemical or biophysical properties. These correlations provide 
predictive rules for general protein behavior. The correlations may link protein sequence 
5 information alone, or in combination with one or more biochemical or biophysical properties, to 
a certain characteristic or a set of characteristics. Using the correlations obtained from the data- 
mining techniques, the properties of new proteins are determined given their amino acid 
sequence information alone or using a combination of the sequence information and one or more 
empirical properties. 

10 

Data-mining techniques, or knowledge discovery tools, are computer algorithms and 
associated software for identifying relationships between elements of the database, particularly 
5 relationships to protein sequence features. Data-mining techniques include, for example, 
sj decision-tree analysis, case-based reasoning, Bayesian classifier, simple linear discriminant 
B analysis, and support vector machines. We found decision tree analysis to be most useful for 
M comprehensibly summarizing the multivariate data and for developing prediction rules of a 
relatively small database of protein solubility and crystallization data . 

lU The predictive nature of the invention allows one to preemptively adjust experimental 

|| conditions to optimize, for example, cloning techniques, protein expression techniques, 
U purification techniques and protein structure determination techniques. Thus, the invention 
provides a method for optimizing high-throughput protein structure determination. Using the 
predictive power of the empirical database in conjunction with data-mining tools, and the 
correlations obtained therefrom, the biochemical and biophysical properties of new proteins are 
25 predicted. Based upon these predictions, experimental conditions for the analysis of a protein, or 
class of proteins, is modified. Conversely, the invention provides a screening method to identify 
proteins that exhibit the desired properties for structural analysis or for use as a substrate for 
high-throughput drug screening. By the method of the invention, the biochemical or biophysical 
properties of new proteins are determined. Proteins that are determined to have a desired 
30 property or properties are then selected for further analysis. In this way, optimal proteins can be 
selected based on properties including one or more of crystallizability, suitability for NMR, 
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expressability in a certain vector, solubility, suitability for study by a certain investigative 
technique and suitability for drug screens. 

The method of the invention speeds up the high-throughput structure determination 
5 process. The three dimensional structure of a protein can reveal whether it is likely to be a good 
drug target. Good drug targets generally, have deep, often hydrophobic, clefts or grooves on their 
surface or at their active sites where small molecule drugs can bind with high affinity. Poor drug 
targets have shallow grooves or otherwise poor surface properties that do not allow for high 
affinity binding of small molecules. By rapidly identifying which proteins have surface 
10 properties that make it promising for drug binding, the method greatly facilitates the drug 
discovery process. 

5 The invention provides a method to identify proteins that exhibit desired biochemical 

properties for drug interaction. Such biochemical properties may include the propensity to bind 
ft or interact with certain small molecules such as, for example, hydrophobic compounds, 
|=fe carbohydrates, or metal ions, or certain classes of drugs, pesticides, herbicide, or insecticides. By 
r the method of the invention, the properties of new proteins are determined. Proteins that are 

determined to have a desired property or properties are then selected for further analysis. The 
1U screening of proteins as potential drug targets allows the researcher to selectively study proteins 
§§ that are predicted to have desired biochemical or biophysical properties, thus reducing the 
u research time and costs while greatly increasing the chance of success. The invention provides a 

method of predicting which proteins are amenable to investigation as drug targets, thus speeding 

up the drug discovery process. 

25 For example, using the method of the invention, allows us to predict from protein 

sequence information which proteins will be soluble and stable - a requirement for high- 
throughput screening of drug-target candidates. Thus, it greatly facilitates the development of 
high-throughput screening methods. Additionally, it will allow us to predict which proteins will 
crystalize, and under what conditions, and which proteins will be amenable to NMR structure 

30 determination. The structure of a protein is useful in designing inhibitors or drugs that target that 
protein. The invention provides a rapid method of predicting which proteins are amenable to 
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structure determination, thus speeding up the drug discovery process. In addition, the method of 
the invention will tell us which sequence features make a protein less amenable to structure 
determination, or less soluble and less stable. Thus, it provides the necessary knowledge to make 
point mutations, allowing the production of an analogous protein that will be more amenable to 

5 structure determination, or more soluble and more stable, again facilitating the target 

identification, validation and high-throughput screening and drug design processes. Certain 
classes of proteins, such as a specific enzyme class, may exhibit unique biochemical or 
biophysical properties. The invention would allow the creation of "class-specific" 
characteristics, which would allow us to discover new members of the class or to modify 

10 members of the class to be more optimal in terms of activity, solubility or suitability for structure 
determination. 

*S The more protein and characteristics compiled in the database, the greater the predictive 

powers achieved from the rules derived from the data-mining. For this reason the use of high 
S| throughput techniques in the assembly of the database is desirable. The wide availability of 
H< recombinant DNA technology makes it feasible to generate expression systems that can produce 
large quantities of a selected protein. The steps for protein production may include: generation of 
the protein expression systems, overexpressing the protein and purifying the protein. 

gj) The generation of a clone for any particular gene of interest, and its incorporation into a 

U suitable expression vector, is now a straightforward task that can be done in a parallel fashion for 

high-throughput production. The selection of target proteins for structural analysis from 

completely sequenced genomes can take advantage of the availability of these cloned genes. 

However, even if a clone of a particular protein of interest is not readily available, it has now 
25 become a routine operation to generate a cDNA clone for almost any particular protein from a 

wide variety of organisms. 

To obtain expression of a cloned nucleic acid, the expression vector for expression in 
bacteria contains a strong promoter to direct transcription, a transcription/translation terminator, 
30 and if the nucleic acid encodes a peptide or polypeptide, a ribosome binding site for translational 
initiation. Suitable bacterial promoters are well known in the art and described, e.g., in 
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Sambrook et al. and Ausubel et al. Bacterial expression systems are available in, e.g., E. coli, 
Bacillus sp., and Salmonella (Palva et al., Gene 22:229-235 (1983); Mosbach et al., Nature 
302:543-545 (1983). Kits for such expression systems are commercially available. Eukaryotic 
expression systems for mammalian cells, yeast, and insect cells are well known in the art and are 
5 also commercially available. In certain cases, where post-translational modifications, for 
example, glycosylation are important, eukaryotic expression systems are preferred. In some 
cases, it may be preferable to employ expression vectors which can be propagated in both 
prokaryotic and eukaryotic cells, enabling, for example, nucleic acid purification and analysis 
using one organism and protein expression using another. 

10 

Transfection methods used to produce bacterial, mammalian, yeast or insect cells or cell 
lines that express large quantities of protein are well known in the art. These include the use of 
3 calcium phosphate transfection, polybrene, protoplast fusion, electroporation, liposomes, 
microinjection, plasma vectors, viral vectors and any of the other well known methods for 
jjj introducing cloned genomic DNA, cDNA, synthetic DNA or other foreign genetic material into a 
i host cell (see, e.g., Sambrook et al., supra). After the expression vector is introduced into the 
? cells, the transfected cells are cultured under conditions favoring expression of protein, which are 
% then purified using standard techniques. 

m The protein is expressed in suitable amounts for further analysis. There are several 

° expression systems that have been extensively studied. Some of these include: 1) bacterial (E. 
coli), 2) methylotrophic yeast (Pichia pastorisis), 3) viral (baculovirus, adenovirus, vaccinia and 
some RNA viruses), 4) cell culture (mammalian and insect), and 5) in vitro translation. Although 
the expression of any particular protein may be idiosyncratic, the availability of these and other 
25 expression systems significantly increases the ability to produce large quantities of protein. 

In situations in which relatively large amounts of relatively pure protein in native form 
are required, for example to obtain protein crystals useful for determination of 3D structure, it 
will be desirable to employ expression systems characterized by high expression levels, efficient 
30 protein processing including cleavage of signal peptides and other post-translational 

modifications. The baculovirus expression system is widely used to express a variety of proteins 
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in large quantities. In addition to fulfilling the above requirements, the size of the expressed 
protein is not limited, and expressed proteins are typically correctly folded and in a biologically 
active state. Baclovirus expression vectors and expression systems are commercially available 
(Clontech, Palo Alto, CA; Invitrogen Corp., Carlsbad, CA). 

5 

Once a protein has been expressed to an acceptable level, the protein is purified from the 
other contents of the cell system that was utilized for expression. Highly purified protein is often 
desirable for further analysis according to the method of the invention. The proteins can be 
expressed fused to tags that aid subsequent purification or measurement techniques. Typical tags 
1 0 bind specifically to particular ligands, allowing the attached protein to be purified without regard 
to its physical or biochemical characteristics. Such tags can then be cleaved, leaving the protein 
in its native form. Examples of tags include histidine rich sequences which bind to various metal 
i ions and glutathione-S-transferase (GST) tags which selectively bind to glutathione. The ligands 
si are typically attached to a solid support. The fusion proteins are bound to the immobilized ligand 
B and unbound material is removed. In certain cases, the fusion protein also includes a cleavable 
N- sequence of amino acids between the protein of interest and the tag sequence whereby the tag can 
■J* be cleaved from the protein of interest. Typically, this is accomplished with a protease that 
y cleaves the sequence under conditions where the protein of interest is not degraded, or with an 
W intein sequence, which allows for internal cleavage of the protein. Alternatively, the tags can 
W provide a method for specifically anchoring proteins to a solid support for assay purposes. For 
U example, it can be useful to anchor proteins to an assay plate in order to measure fluorescence 
and fluorescence quenching in the presence of potential ligands. In another embodiment, a solid 
support is employed which provides an array of binding surfaces to which different proteins of 
the library are anchored for use in protein-ligand and protein-protein interaction studies. The 
25 solid support can be, for example, a glass or plastic plate, a semi-solid or gel-like matrix or the 
surface of a semiconductor measuring device. Bacterial vectors designed for production of GST 
fusion proteins are commercially available which allow cloning of DNAs in all three reading 
frames (e.g., pGEX series of vectors; Amersham Pharmacia Biotech, Inc., Piscataway, NJ). 

30 The following examples are provided as illustrative of the present invention and are not limiting. 
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EXAMPLE I 

To explore the feasibility of a comprehensive structural proteomics project, 424 non- 
membrane proteins of unknown structure from Methanobacterium thermoautotrophicum are 
cloned, expressed in E. coli and purified. Using a single high-throughput protocol, about 20% of 
5 these are found to be suitable candidates for x-ray crystallographic or NMR spectroscopic 
analysis without further optimization of conditions, providing an estimate of the number of the 
most readily accessible structural targets in a proteome. A retrospective analysis of the empirical 
characteristics, including the experimental behavior, of these proteins provides some simple 
relations between sequence and biochemical and biophysical properties. A comprehensive 

1 0 database of protein properties is useful in optimizing high-throughput strategies. 

Target selection 

3 M.th. is a thermophilic Archaeon whose genome comprises 1871 Open Reading Frames 7 . 

1 1 Archaeal proteins share many sequence and functional features with eukaryotic proteins, but are 
B often smaller and more robust, and thus serve as excellent model systems for complex processes. 
|i Only two exclusionary criteria are implemented in our target selection scheme. First, membrane- 
^ associated proteins, which comprise approximately 30% (267-422 of 1871 ORFs) of the M.th. 
f l proteome, are excluded. Second, proteins that have clear homologues in the PDB are excluded 
ti (approximately 27% of M. th. proteins). The remaining proteins (-900) are not prioritized based 
§§ on their probability of having a new fold, nor in terms of "biological relevance". We chose to 

~ invest our effort in developing high-throughput methods to generate a large collection of proteins 
to test as candidates for structural analysis, rather than concentrating on a small set of "high 
priority" targets, a large proportion of which may not be amenable for immediate structural 
analysis. Thus, 424 of the 900 final target M.th. proteins (almost a quarter of the entire proteome 

25 and a third of the non-membrane proteins) are chosen for cloning, expression and subsequent 

studies. These represent an unbiased sampling of non-membrane proteins from a single proteome 
with 34% having a functional annotation, 54% classified as "conserved" and 12% as "unknown". 
This diverse collection of proteins is particularly valuable for retrospective analysis aimed at 
identifying sequence features that are predictive of protein biophysical and biochemical behavior. 

30 

Cloning strategy 
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Each target gene is PCR-amplified from genomic DNA under standard, but optimized, 
conditions, with terminal incorporation of unique restriction sites, using high fidelity Pfu DNA 
polymerase (Stratagene). The PCR products are directionally cloned into the pET15b bacterial 
expression vector (NOVAGEN). The resulting plasmid encodes a fusion protein with an N- 
5 terminal hexa-histidine tag followed by a thrombin cleavage site. In the interest of throughput, no 
other expression vectors or organisms were used. 

A single PCR protocol and set of cloning conditions are optimized for M.th. based on an 
analysis of an initial set of 50 genes. Positive clones are confirmed by colony PCR screening 
10 using Taq DNA polymerase. The generic nature of the procedure resulted in some PCR and sub- 
cloning failures, leading to a cumulative attrition rate of -6%. This protocol is readily scalable to 
96-well format and has been extended to alternative vectors and expression organisms. 

j Expression strategy 

fj The M.th. open reading frames are divided arbitrarily into two groups, "large" (>20 kDa 

U monomer size) and "small" (<20 kDa). Large proteins are processed for crystallization trials and 
^ small proteins for NMR feasibility studies. Most (-80%) successfully cloned M. th. proteins could 
S be expressed in E coli BL2 1 -Gold (DE3) cells (Stratagene), although efficient expression often 
W requires the presence of a second plasmid encoding three tRNAs which are frequently used by 
If archeons and eukaryotes but are rare in E. coli. While most proteins are expressed to reasonable 
13 levels, many are not expressed in soluble form (<0.5 mg/L soluble protein), especially in the case 
of the larger proteins. It is possible to reduce the attrition rate due to poor solubility by 
optimizing the expression conditions for each clone. However, in the interest of throughput we 
used a single set of growth conditions optimized for the majority of proteins. 

25 

Purification and crystallization of large proteins 

For large proteins, three colonies from each transformation are tested for protein 
expression on a small scale (50 mL). Proteins found to be soluble by SDS-PAGE analysis of the 
bacterial extract are prepared on a larger scale (2 L). These proteins are purified by a 
30 combination of heat-treatment (55 °C) and nickel affinity chromatography, followed by thrombin 
cleavage and removal of the hexa-histidine tag. The heat treatment causes a significant 
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enrichment of many, but not all, M.th. proteins. The purification of the proteins is monitored by 
denaturing gel electrophoresis and occasionally by mass spectrometry. Proteins that survive the 
purification process (-75%) are concentrated to 10 mg/ml and subjected to a sparse-matrix 
crystallization screen of 48 conditions at room temperature (Matrix screen 1; Hampton 

5 Research). For each protein that crystallizes in the initial screen, conditions are further optimized 
using an expansion of related solution conditions (typically 18-20 screens of 24 conditions for 
each protein). We chose 24 of the proteins that formed crystals in the primary screen to follow 
up with optimization screens. Of these 1 1 formed well diffracting crystals (< 3.0 A). The 
implementation of automated methods for setting up and monitoring crystal screens can improve 

1 0 the throughput this process. 

_ Purification and NMR screening of small proteins 

2 The smaller proteins (<20 kDa predicted monomer size) destined for NMR analysis are 

5 expressed five at a time, each in 1 L of 15 N-enriched minimal media and purified in parallel using 
jj metal affinity chromatography. The resulting 15 N-labeled hexa-histidine fusion proteins are 
P concentrated by ultrafiltration to ~ 5-20 mg/ml, and the 15 N-HSQC NMR spectrum taken at 
y 25 °C. The HSQC spectra are classified into one of three categories. The first, termed "excellent" 
° and indicative of soluble, globular proteins, contained the predicted number of dispersed peaks of 
W roughly equal intensity. These excellent spectra suggest that the process of determining their 3D 
§f) structure is relatively straight- forward. The second type of spectrum, termed "promising", had 
0 features such as too few or too many peaks and/or broad but dispersed signals. This suggests that 
optimization of either the protein construct or the solution conditions would be needed to yield 
an excellent sample. The last category, termed "poor", comprises two kinds of spectra. The first, 
which have intense peaks but with little dispersion in the 15 N-dimension, most likely reflects 
25 proteins that are soluble yet, largely unfolded. The second class has very low signal-to-noise 
and/or a single cluster of very broad peaks in the center of the spectrum. This class probably 
represents proteins which aggregate nonspecifically at concentrations required for NMR 
spectroscopy and thus are not readily amenable to structural analysis. For the 100 soluble 
proteins tested, the ratio of excellent/promising/poor spectra was 33/10/57. 
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Of the 33 proteins showing excellent spectra, seven are initially chosen for more detailed 
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structure determination using NMR spectroscopy. For these samples the his-affinity tag is 
removed by proteolytic cleavage; this does not markedly change the spectral properties of the 
proteins, suggesting that this step may be omitted in the interest of saving time and maximizing 
protein yield. In one case (MTH40) it was necessary to further optimize solution conditions in 
order to prepare a sample that was stable for the time period (several weeks) necessary for NMR 
data collection. 

EXAMPLE II 

Analysis of Protein Folding and Stability by Circular Dichroism (CD) Spectroscopy: 

To explore how other spectroscopic techniques might aid in the identification of proteins 
suitable for detailed structural analysis, CD experiments are performed on 100 of the small, 
soluble MT proteins. Of the 28 proteins with excellent NMR spectra that are examined, all but 6 
displayed CD spectra that were typical of folded proteins containing a significant fraction of a- 
helical and/or p-sheet secondary structure. The six atypical spectra may have resulted from 
unusual structural features of the proteins in question (e.g. small p-sheet proteins like SH3 
domains possess very unusual CD spectra). Interestingly, 24 out of 32 proteins classified as 
"aggregated" by NMR spectroscopy display CD spectra consistent with stable, folded proteins. 
This suggests that the aggregation mechanism for many of the NMR samples may be due to 
surface interactions in the folded state, as opposed to aggregation of the exposed hydrophobic 
cores of unfolded proteins. Knowledge of the aggregation mechanism is useful for optimizing 
solution conditions that disfavor aggregation and therefore, CD provides a useful secondary 
screen in structural proteomics projects. 

To better understand the contribution of protein stability to sample behavior, the thermal 
unfolding of 60 folded MT proteins is analyzed. Of these, 22 are unfolded and refolded in a fully 
reversible manner. However, among the 19 proteins with "excellent" NMR spectra that are tested 
in this manner, only 9 refold reversibly. The others precipitate at high temperatures, 
demonstrating that even among well-folded, small, soluble proteins, reversible thermal unfolding 
in vitro is not a ubiquitous property. Surprisingly, 8 proteins classified as "aggregated" by NMR 
are well-behaved in thermal unfolding experiments, indicating that these proteins are probably 
large discrete oligomers rather than non-specific aggregates. 
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As expected for proteins from a thermophilic organism, those from M.th. all possess high 
thermostability with transition midpoint temperature (T m ) values between 68 °C and 98 °C. Due 
to their low change in heat capacity (AC p ) upon unfolding, small proteins are generally expected 
to have higher T m values compared to larger proteins 8 . Here, however, we observe no correlation 
5 between the length of the MT proteins and their T m values. The AC p values of small M. th. 

proteins are within the expected range as compared to a large number of other proteins that have 
been investigated (data not shown; 9 ). These data suggest that except for their high thermal 
stability, the overall thermodynamic behavior of M.th. proteins studied here may be 
representative of other mesophilic organisms. 

10 

EXAMPLE III 

5 Retrospective analysis of a database of biophysical and/or biochemical properties 

't I These studies reveal that poor expression and solubility account for almost 60% of the 

jj recalcitrant proteins. To identify the parameters that contribute to this poor sample behavior (and 
I * other factors related to suitability for expression, purification, and structure analysis), a 

retrospective data-mining approach is applied. Sequence data from the -424 proteins and the 
'J biophysical and biochemical data (expressability, crystallizability, solubility and melting 
m temperature) are used to compile a database. Decision trees are useful for comprehensibly 
£J> summarizing multivariate data and developing simple prediction rules. Growing the trees 
13 requires devising strategies regarding which variables (or combination of variables) to divide on, 

and what threshold to use to achieve the split. The 53 "splitting variables" used are derived from 

simple attributes of each sequence (e.g. amino acid composition, similarity to other proteins, 

measures of hydrophobicity, regions of low sequence complexity, etc.). 

25 

The full tree classifying the proteins according to their solubility (yes/no) has 35 final 
nodes and 65% overall accuracy in cross-validated tests. However, a number of the rules encoded 
within the tree were of much better predictive value. These are highlighted in Figure 1. 
Figure 1 depicts a decision tree for discriminating between soluble and insoluble proteins. The 
30 nodes of the tree are represented by ellipses (intermediate nodes) and rectangles (final nodes or 
leaves). The numbers on the left of each node denote the number of insoluble proteins in the 
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node, and are proportional to the node's dark area. Similarly, the numbers on the right denote the 
soluble proteins and are proportional to the white area. Under each intermediate node, the 
decision tree algorithm calculates all possible splitting thresholds for each of 53 variables 
(hydrophobicity, amino acid composition, etc.). It picks the optimal splitting variable and its 
5 threshold, in order for at least one of the two daughter nodes to be as homogeneous as possible. 
When a variable, v, is split, v<threshold is the left branch, and v>threshold is the right branch. 
The specific parameters used at each node and their thresholds for the right branches shown in 
the graph are in descending order (from top root to bottom leaves): hydrophobe > 0.85 kcal/mole 
(where "hydrophobe" represents the average GES hydrophobicity of a sequence stretch, the 
10 higher this value the lower is the energy transfer); cplx>0.28 (a measure of a short complexity 
region based on the SEG program); Gin composition> 4%; Asp+Glu composition >17%; lie- 
composition^. 6%; Phe+Tyr+Trp composition >7.5%; Asp+Glu composition > 13.6%; 
5 Gly+Ala+Val+Leu+Ile composition >42%; hydrophobe> 0.01 kcal/mole; His+Lys+Arg 
"t\ composition> 12%; Tip composition > 1.2%; and alpha-helical secondary structure composition 
jj > 58%. Note that two of the variables are conditioned on more than once (hydrophobe, 
M Asp+Glu). The highlighted decision pathways terminate in highly homogeneous nodes (mostly 
? dark is insoluble, mostly white is soluble). The shorter the decision pathway and the larger the 
! J number of cases in the terminal node, the less likely it is to over-fit the data. Heterogeneous 
m leaves could be further split (dotted lines) improving the error rate but risking over-fitting of the 
§J training set. The usual technique for assessing the predictive success of rules suggested by the 
° tree in the context of overfitting is cross-validation, where the overall data set is divided into test 
and training components. However, this technique is not optimal on the relatively small samples 
associated with each rule in these trees, as one has to leave out a substantial fraction of 
information in devising each rule. The predictive values of the highlighted decision pathways are 
25 evaluated using a "pessimistic estimation" procedure which assumes that the error rate at each 
node is bionomially distributed, and then inflates the rate found on a tree based on all the data (by 
~2 standard deviations) to arrive at a more realistic estimate. 



Proteins that fulfill the following sequence of four conditions are likely to be insoluble: 
30 ( 1 ) have a hydrophobic stretch - a long region (>20 residues) with average hydrophobicity less 
than -0.85 kcal/mole (on the GES scale, ); (2) Gin composition <4%; (3) Asp+Glu composition < 
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17%; and (4) aromatic composition >7.5% This rule has a 14% error rate in comparison to the 
default error rate of 39% for choosing a soluble protein without the aid of the tree. The 
probability that it could arise by chance is 1%, assuming one randomly chose the 24 insoluble 
proteins from the initial pool of 143 insoluble and 213 soluble proteins. These calculations are 
based on a "pessimistic estimate for errors" 11 , taking the upper bound of the 95% confidence 
interval (see Fig. 3 for details). Conversely, proteins that do not have a hydrophobic stretch and 
have more than 27% of their residues in (hydrophilic) "low-complexity" regions are very likely to 
be soluble. This rule has a "pessimistic" error rate of 20% in contrast to 39% without the tree and 
a 1% probability of occurring by chance. 

We also derived similar trees for expressability and crystallizability. We found that the 
composition of Asn appeared to be relevant to crystallizability. In particular, an Asn threshold of 
3.5%o was able to select a set of 18 crystallizable and only one non-crystallizable protein from our 
initial set of 25 crystallizable and 39 non-crystallizable proteins. 

Together these data suggest that, using the database of protein sequence information, and 
biochemical and biophysical properties, it is possible to derive sets of "rules" from primary 
sequence that are predictive of a given protein's biophysical and biochemical properties. 
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WHAT IS CLAIMED IS: 



1 . A method of determining the biochemical or biophysical properties of a protein, said 
method comprising the steps of: 

a) providing a database comprising protein sequence information and protein 
biochemical and/or biophysical properties, 

b) analyzing the database using a data-mining technique, 

c) correlating protein sequence, biochemical properties or biophysical properties, and 

d) analyzing the sequence of the protein using the correlations to determine its 
biochemical or biophysical properties. 

2. The method of claim 1 , wherein the property being determined is a biophysical property. 

3. The method of claim 2, wherein the biophysical property is thermal stability, solubility, 
isoelectric point, pH stability, crystalizability, conditions of crystallization, aggregation 
state, heat capacity (AC p ), resistance to chemical denaturation, resistance to proteolytic 
degradation, amide hydrogen exchange data, behavior on chromatographic matrices, 
electrophoretic mobility, resistance to degradation during mass spectrometry, and results 
obtained from nuclear magnetic resonance, X-ray crystallography, circular dichroism, 
light scattering, atomic adsorption, fluorescence, fluorescence quenching, mass 
spectroscopy, infrared spectroscopy, electron microscopy and atomic force microscopy. 

4. The method of claim 3, wherein the biophysical property is thermal stability. 

5 . The method of claim 3 , wherein the biophysical property is solubility. 

6. The method of claim 3 , wherein the biophysical property is crystalizability. 

7 . The method of claim 3 , wherein the biophysical property is conditions of crystallization. 

8 . The method of claim 3 , wherein the biophysical property is isoelectric point. 
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9. The method of claim 3 , wherein the biophysical property is pH stability. 

10. The method of claim 3, wherein the biophysical property is aggregation state. 

11. The method of claim 3, wherein the biophysical property is heat capacity (AC p ). 

12. The method of claim 3 , wherein the biophysical property is resistance to chemical 
denaturation. 

13. The method of claim 3 , wherein the biophysical property is resistance to proteolytic 
degradation. 

14. The method of claim 3, wherein the biophysical property is amide hydrogen exchange 
data. 

15. The method of claim 3 , wherein the biophysical property is behavior on chromatographic 
matrices. 

16. The method of claim 3, wherein the biophysical property is electrophoretic mobility. 

17. The method of claim 3 , wherein the biophysical property is resistance to degradation 
during mass spectrometry. 

18. The method of claim 3 , wherein the biophysical property is the results obtained from 
nuclear magnetic resonance. 

19. The method of claim 3 , wherein the biophysical property is the results obtained from X- 
ray crystallography. 

20. The method of claim 3 , wherein the biophysical property is the results obtained from 
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circular dichroism. 



21. The method of claim 3, wherein the biophysical property is the results obtained from light 
scattering. 

22 . The method of claim 3 , wherein the biophysical property is the results obtained from 
atomic adsorption. 

23 . The method of claim 3 , wherein biophysical property is the results obtained from 
fluorescence. 

24. The method of claim 3, wherein the biophysical property is the results obtained from 
fluorescence quenching. 

25. The method of claim 3, wherein the biophysical property is the results obtained from 
mass spectroscopy. 

26. The method of claim 3 , wherein the biophysical property is the results obtained from 
infrared spectroscopy. 

27. The method of claim 3, wherein biophysical property is the results obtained from electron 
microscopy. 

28. The method of claim 3, wherein the biophysical property is the results obtained from 
atomic force microscopy. 

29. The method of claim 1 , wherein the property being determined is a biochemical property. 

30. The method of claim 29, wherein the biochemical property is expressability, protein 
yield, small-molecule binding, subcellular localization, utility as a drug target, protein- 
protein interactions or protein-ligand interactions. 
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3 1 . The method of claim 30, wherein the biochemical property is small-molecule binding. 

32. The method of claim 30, wherein the biochemical property is protein yield. 

33 . The method of claim 30, wherein the biochemical property is expressability. 

34. The method of claim 30, wherein the biochemical property is subcellular localization. 

35. The method of claim 30, wherein the biochemical property is utility as a drug target. 

36. The method of claim 30, wherein the biochemical property is protein-protein interactions. 

37. The method of claim 30, wherein the biochemical property is protein-ligand interactions. 

38. The method of claim 1 , wherein the data-mining technique is selected from the group 
decision-tree analysis, case-based reasoning, Bayesian classifier, simple linear 
discriminant analysis, and support vector machines. 

39. The method of claim 3 8, wherein the data-mining technique is decision-tree analysis. 

40. The method of claim 38, wherein the data-mining technique is case-based reasoning. 

41 . The method of claim 3 8, wherein the data-mining technique is Bayesian classifier. 

42. The method of claim 3 8, wherein the data-mining technique is simple linear discriminant 
analysis. 

43. The method of claim 38, wherein the data-mining technique is support vector machines. 

44. A method of optimizing high-throughput protein structure determination, said method 
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comprising the steps of : 

a) providing a database comprising protein sequence information and protein 
biochemical and biophysical properties, 

b) analyzing the database using a data-mining technique, 

c) determining correlations between protein sequence and biochemical or biophysical 
properties, 

d) analyzing the sequence of a protein using said correlations to determine its 
biochemical or biophysical properties, and 

e) optimizing the throughput of the protein structure determination based on said 
biochemical or biophysical properties by modifying the experimental procedures 
and/or modifying the protein sequence. 

45. The method of claim 44, wherein the data-mining technique is selected from the group 
decision-tree analysis, case-based reasoning, Bayesian classifier, simple linear 
discriminant analysis, and 

46. The method of claim 45, wherein the data-mining technique is decision-tree analysis. 

47. The method of claim 45, wherein the data-mining technique is case-based reasoning. 

48. The method of claim 45, wherein the data-mining technique is Bayesian classifier. 

49. The method of claim 45, wherein the data-mining technique is simple linear discriminant 
analysis. 

50. The method of claim 45, wherein the data-mining technique is support vector machines. 

51. A method of optimizing high-throughput protein purification, said method comprising the 
steps of : 

a) providing a database comprising protein sequence information and protein 
biochemical and biophysical properties, 
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b) analyzing the database using a data-mining technique, 

c) determining correlations between protein sequence and biochemical or biophysical 
properties, 

d) analyzing the sequence of a protein using the correlations to determine its 
biochemical or biophysical properties, and 

e) optimizing the throughput of the protein purification based on said biochemical or 
biophysical properties by modifying the experimental procedures and/or modifying 
the protein sequence. 

52. The method of claim 5 1 , wherein the data-mining technique is selected from the group 
decision-tree analysis, case-based reasoning, Bayesian classifier, simple linear 
discriminant analysis, and support vector machines, 

53. The method of claim 52, wherein the data-mining technique is decision-tree analysis. 

54. The method of claim 52, wherein the data-mining technique is case-based reasoning. 

55. The method of claim 52, wherein the data-mining technique is Bayesian classifier. 

56. The method of claim 52, wherein the data-mining technique is simple linear discriminant 
analysis. 

57. The method of claim 52, wherein the data-mining technique is support vector machines. 

58. A method of optimizing high-throughput protein expression, said method comprising the 
steps of : 

a) providing a database comprising protein sequence information and protein 
biochemical and biophysical properties, 

b) analyzing the database using a data-mining technique, 

c) determining correlations between protein sequence and biochemical or biophysical 
properties, 
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d) analyzing the sequence of the protein using the correlations to determine its 
biochemical or biophysical properties, and 

e) optimizing throughput of the protein expression based on said biochemical or 
biophysical properties by modifying the experimental procedures and/or modifying 
the protein sequence. 

59. The method of claim 58, wherein the data-mining technique is selected from the group 
decision-tree analysis, case-based reasoning, Bayesian classifier, simple linear 
discriminant analysis, and 

60. The method of claim 59, wherein the data-mining technique is decision-tree analysis. 

6 1 . The method of claim 59, wherein the data-mining technique is case-based reasoning. 

62. The method of claim 59, wherein the data-mining technique is Bayesian classifier. 

63. The method of claim 59, wherein the data-mining technique is simple linear discriminant 
analysis. 

64. The method of claim 59, wherein the data-mining technique is support vector machines. 

65. A method of optimizing drug-target discovery, said method comprising the steps of : 

a) providing a database comprising protein sequence information and protein 
biochemical and biophysical properties, 

b) analyzing the database using a data-mining technique, 

c) determining correlations between protein sequence and biochemical or biophysical 
properties, 

d) analyzing the sequence of a protein using the correlations to determine its 
biochemical or biophysical properties, and 

e) optimizing drug-target discovery base on said biochemical or biophysical 
properties by modifying the experimental procedures and/or modifying the protein 
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sequence. 



A method of screening proteins for drug-target discovery, said method comprising the 
steps of : 

a) providing a database comprising protein sequence information and protein 
biochemical and biophysical properties, 

b) analyzing the database using a data-mining technique, 

c) determining correlations between protein sequence and biochemical or biophysical 
properties, 

d) analyzing the sequence of the protein using the correlations to determine its 
biochemical or biophysical properties, and 

e) selecting proteins for analysis as a drug target based on their predicted biochemical 
and/or biophysical properties. 
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ABSTRACT 

The present invention relates to a method for determining relationships among a protein's 
biochemical properties, biophysical properties and amino acid sequence is provided. The 
invention further provides a database of protein sequence information and experimentally 
determined protein properties. This database is analyzed using data-mining techniques to find 
correlations among protein sequence information, biochemical properties and biophysical 
properties. Using the empirical correlations obtained from the data-mining techniques, the 
properties of new proteins are determined given their amino acid sequence information alone or 
using a combination of the sequence information and one or more properties. 



29 




Figure 1 



